bla bla bla
Loading the data sets: fruits and vegetables prices in Nepal, as well as the Consumer Price Index (CPI) for Nepal. The fruits and vegetables table contains the average price for vegetables and fruits on a given day. For the time series analysis this will be aggregated to the month level by taking the average of the average daily prices in a particular month.
The generated tsibble looks like this.
# Loading Datasets.
veggies <- read_csv("data_sources/kalimati_tarkari_dataset.csv",
col_types = cols(Date = col_date(format = "%Y-%m-%d")))
suppressWarnings({
cpi <- read_excel("data_sources/Nepal CPI Trading Economics.xlsx",
col_types = c("date", "numeric"))
})
cpi <- cpi |>
filter(Date >= '2013-08-01 00:00:00')
# tidy up cpi
base_index <- cpi$CPI[1]
cpi <- cpi |>
mutate(Date = yearmonth(cpi$Date),
CPI = CPI/base_index) |>
select(c(Date, CPI))
suppressMessages({
veggies <- veggies %>%
filter(Unit == "Kg") %>%
filter(Commodity %in% c("Tomato Big(Nepali)", "Potato Red",
"Onion Dry (Indian)", "Carrot(Local)",
"Cabbage(Local)")) %>%
mutate(Date = yearmonth(Date)) %>%
select(Commodity, Date, Average) %>%
group_by(Date, Commodity) %>%
summarize(Average = mean(Average)) %>%
ungroup()
})
veggies_ts <- veggies |>
as_tsibble(key = Commodity,
index = Date)
head(veggies_ts, 5)
## # A tsibble: 5 x 3 [1M]
## # Key: Commodity [1]
## Date Commodity Average
## <mth> <chr> <dbl>
## 1 2013 Jun Cabbage(Local) 11.1
## 2 2013 Jul Cabbage(Local) 16.2
## 3 2013 Aug Cabbage(Local) 35.7
## 4 2013 Sep Cabbage(Local) 42.4
## 5 2013 Oct Cabbage(Local) 32.2
This plot shows the average monthly price in Nepalese Rupees of 5 vegetables in Nepal.
veggies_ts |>
autoplot(Average) +
labs(y="Nepalese Rupees",
title="Average price per month of various vegetables in Nepal") +
theme_bw()
The following plot shows the average monthly price of Potato Red in Nepal. The graph shows some spikes around mid year accross all of the years of the observed data, and also shows a high increment in price in 2020.
potato <- veggies_ts |>
filter(Commodity == "Potato Red")
potato |>
autoplot(Average) +
labs(y="Nepalese Rupee",
title="Average Price of Potato Red in Nepal") +
theme_bw()
Moreover, the drastic increase in price in 2020 can be seen in the following plot. 2020 is the year with the highes price of Potato Red in the second semester of the year. In this plot we can also observe how, on average, the second semester of the year sees an increase in price of the potato.
potato |>
gg_season(Average) +
labs(y="Nepalese Rupee",
title="Average Price of Potato Red in Nepal") +
theme_bw()
This plot shows the average price of Potato Red in Nepal per month, as we can observe, the month with the highest price, on average, is October. This higher price coincide with the peak season in Nepal October-November. In this months Nepal experience a surge in tourism due to trekking activities but also coincides with the biggest festival in the Country, Dashin, which lasts for 15 days in September and October. The big spike in prices in 2020 can be attributed to a slump in production as mentioned by https://myrepublica.nagariknetwork.com/news/escalating-prices-of-potato-onion-and-tomato-make-kitchen-expenses-dearer/.
potato |>
gg_subseries(Average) +
labs(y="Nepalese Rupee",
title="Average Price of Potato Red in Nepal") +
theme(axis.text.x = element_blank(), # Hide x-axis text
axis.ticks.x = element_blank())
The price seems to have some seasonal component but also some sort of dependence on the previous months’ prices. The following plot shows the autocorrelation plot of the Potatoes price in Nepal. As seen by the graph the price is highly correlated with the previous month prices and the correlation continues for 3 periods, then decreases.
potato |>
ACF(Average, lag_max = 15) |>
autoplot() +
labs(title="Autocorrelation plot") +
theme_bw()
It’s important to note that economic series tend to increase overtime, specially prices. Therefore, an adjustment is necessary to compare the prices across time. In this analysis the CPI of Nepal will be used as a deflator, the data was obtained from INSERT REFERENCE and the base for comparison is August 2013.
potato_constant_prices <- potato |>
inner_join(cpi, by = join_by(Date)) |>
mutate(Average_constant_prices = Average / CPI) |>
select(Date, Commodity, Average_constant_prices)
head(potato_constant_prices, 5)
## # A tsibble: 5 x 3 [1M]
## # Key: Commodity [1]
## Date Commodity Average_constant_prices
## <mth> <chr> <dbl>
## 1 2013 Aug Potato Red 28.1
## 2 2013 Sep Potato Red 30.3
## 3 2013 Oct Potato Red 36.6
## 4 2013 Nov Potato Red 38.9
## 5 2013 Dec Potato Red 30.8
Visually a representation of the difference between the average price per month and the average price per month with constant prices. As seen by the plot the cpi deflator reduces the distance of the spikes with the lowest points in the graphics, reducing the variability.
p1 <- potato_constant_prices |>
autoplot(Average_constant_prices) +
labs(y="Nepalese Rupees",
title="Potato 2013 prices") +
theme_bw()
p2 <- potato |>
autoplot(Average) +
labs(y="Nepalese Rupee",
title="Average Price of Potato") +
theme_bw()
grid.arrange(p1, p2, ncol = 2)
In the following can be found the difference between the different transformations. The first plot is just the average price divided by the deflator, the second plot is on top of that transformation a box cox transformation and the last plot is on top of division by deflator a log transformation. The Box Cox transformation on top of the constant prices seems to be the best transformation as it somewhat reduces the spikes.
lambda_guerrero_constant_prices <- potato_constant_prices |>
features(Average_constant_prices, features = guerrero) |>
pull(lambda_guerrero)
# Inflation only
p1 <- potato_constant_prices |>
autoplot(Average_constant_prices) +
labs(y="Constant Prices",
title="Average Price of Potato Red, 2013 prices in Nepal") +
theme_bw()
# Box Cox
p2 <- potato_constant_prices |>
autoplot(box_cox(Average_constant_prices, lambda_guerrero_constant_prices)) +
labs(y="Box Cox",
title="Box Cox Transformation of Potato Red, 2013 prices in Nepal") +
theme_bw()
p3 <- potato_constant_prices |>
autoplot(log(Average_constant_prices)) +
labs(y="Log",
title="Average Log Price of Potato Red, 2013 prices in Nepal") +
theme_bw()
grid.arrange(p1, p2, p3, ncol = 1)
The following plot shows in gray the box cox transformation of the average price of Potatoes in Nepal and the STL decomposition of that series, there is no clear trend line upwards or downwards, it looks more as a cyclical pattern, at around midyear of the even years it reaches it’s highest point in the cycle and around midyear of the odd years it reaches the lowest point.
stl_dcmp <- potato_constant_prices |>
model(stl = STL(box_cox(Average_constant_prices, lambda_guerrero_constant_prices)))
potato_constant_prices %>%
autoplot(box_cox(Average_constant_prices, lambda_guerrero_constant_prices), color='gray') +
autolayer(components(stl_dcmp), trend, color='red') +
ylab("Box Cox Transformation of Average Price") +
ggtitle("Box-Cox Transformation Average Price of Potato Red in Nepal") +
theme_bw()
The STL decomposition uses the default window for seasonal pattern of 13 months, making the seasonal pattern the same accross the entire time series, as seen in the season_year plot, every year follows the same pattern around september and october the price increases and at the begining of the year is at the lowest point.
components(stl_dcmp) %>% autoplot()
The following image shows the X-13 Arima decomposition, the main difference is the trend component and the residuals, the Arima model appears to capture more variability in the trend component and the residuals appears to be only withe noise.
seats_dcmp <- potato_constant_prices |>
model(X_13ARIMA_SEATS(box_cox(Average_constant_prices, lambda_guerrero_constant_prices)))
components(seats_dcmp) |>
autoplot()
For forecasting we will create 4 simple models, the seasonal naive, naive, drift and mean. For the mean model, the future values are the mean of the historical data. In the naive model, the forecast is equal to the last value of the time series. Drift is a random walk, last value plus the average change. Finally, the Seasonal Naive, the forecast equals the last value from the same season. We create a mable with the 4 basic models.
four_basic_models <- potato_constant_prices|>
model(
Seasonal_naive = SNAIVE(Average_constant_prices),
Naive = NAIVE(Average_constant_prices),
Drift = RW(Average_constant_prices ~ drift()),
Mean = MEAN(Average_constant_prices))
four_basic_models
## # A mable: 1 x 5
## # Key: Commodity [1]
## Commodity Seasonal_naive Naive Drift Mean
## <chr> <model> <model> <model> <model>
## 1 Potato Red <SNAIVE> <NAIVE> <RW w/ drift> <MEAN>
In the following plot we can see the performance of the models in a 12 month forecast period. It’s evident from the plot that none of the models perform particularity good. There is a clear seasonality in the data, at the second semester of the year the prices increase, the seasonal naive is able to capture that. However, there is no guarantee that the unfortunate slump in production of 2020 will repeat for 2021 towards 2022. It’s also interesting to note that the drift method that consideres the trend is almost the same as the naive, the trend is negative and close to 0.
four_basic_models |>
forecast(h = "12 months") |>
autoplot(potato_constant_prices, level = NULL) +
ggtitle("12 Months forecast for Average Potato Price, 2013 prices.") +
ylab("Nepalese Rupees") +
guides(colour = guide_legend(title = "Forecast")) +
theme_bw()
The following four plots show the model forecasting vs the actual data. It’s evident that the worst model, the one with the largest residuals, is the mean. It does not follow any pattern in the data it just takes the sample mean. The naive and drift models follows the noise of the data closely but with a lag of one period. While the seasonal naive is able to somewhat capture the seasonality of the data but has large errors in 2015, 2018 and 2020 due to changes in the seasonality from one year to the other.
models <- c("Seasonal_naive", "Naive", "Drift", "Mean")
plots <- map(models, ~ four_basic_models |>
select(Commodity, all_of(.x)) |>
augment() |>
ggplot(aes(x = Date)) +
geom_line(aes(y = Average_constant_prices, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
ggtitle(sprintf("Data and %s model fitted", .x)) +
ylab("Nepalese Rupees"))
# Suppress warning messages while printing each plot
walk(plots, ~suppressWarnings(print(.)))
The following four plots shows the residuals of the four basic models. As seen by the plots not a single model reduces the error to white noise. All of the models still exhibit autocorrelation in the errors as seen by the ACF plots, the errors appear to be non-normally distributed, except the seasonal naive model. The conclusion is that these models are not appropriate to model or forecast the time series.
# This chunk could be replaced with a loop if R had actual proper loops instead of weird ones that doesn't weird behavior.
four_basic_models |>
select(Commodity, Seasonal_naive) |>
gg_tsresiduals() +
ggtitle("residuals of Seasonal_naive")
four_basic_models %>%
select(Commodity, Naive) %>%
gg_tsresiduals()+
ggtitle("residuals of Naive")
four_basic_models %>%
select(Commodity, Drift) %>%
gg_tsresiduals()+
ggtitle("residuals of Drift")
four_basic_models %>%
select(Commodity, Mean) %>%
gg_tsresiduals()+
ggtitle("residuals of Mean")
For testing the accuracy of the 4 simple models and compare them, a recursive window method is used, the recursive window will be 18 months, increasing by 1 month. The first table shows the accuracy of the model when forecasting a 12 period window, 12 months and the second table when forecasting only 1 month. The results shows the same conclusion that we reached above by looking at the fitted model vs the data. The drift and naive models are quite similar in results and they are good at forecasting small windows since this two models follows the noise of the data. However, when forecasting longer periods such as 12 months they do not perform as good. On the other hand, the seasonal naive model performs worse than the naive and drift for forecasting small periods of time but perform much better when forecasting longer periods such as 12 months. However, from the residuals we concluded that neither of these methods produces residuals that resembles white noise.
fit_cv <- potato_constant_prices |>
stretch_tsibble(.init = 18, .step = 1) |>
model(
Seasonal_naive = SNAIVE(Average_constant_prices),
Naive = NAIVE(Average_constant_prices),
Drift = RW(Average_constant_prices ~ drift()),
Mean = MEAN(Average_constant_prices))
fc_cv <- fit_cv %>%
forecast(h=12)
suppressMessages({fc_cv %>%
accuracy(potato_constant_prices) %>%
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)})
## # A tibble: 4 × 7
## .model .type RMSE MAE MAPE MASE RMSSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift Test 13.3 10.2 38.4 1.43 1.47
## 2 Mean Test 10.4 7.64 28.6 1.07 1.15
## 3 Naive Test 12.5 9.40 35.4 1.32 1.39
## 4 Seasonal_naive Test 8.71 6.80 24.4 0.955 0.965
fc_cv <- fit_cv %>%
forecast(h=1)
suppressMessages({fc_cv %>%
accuracy(potato_constant_prices) %>%
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)})
## # A tibble: 4 × 7
## .model .type RMSE MAE MAPE MASE RMSSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift Test 5.58 4.24 15.3 0.596 0.618
## 2 Mean Test 10.3 7.73 30.4 1.09 1.14
## 3 Naive Test 5.54 4.21 15.2 0.590 0.613
## 4 Seasonal_naive Test 9.01 7.08 26.5 0.995 0.998